1 00:00:00,000 --> 00:00:05,854 [MUSIC] 2 00:00:05,854 --> 00:00:06,816 Hi everyone. 3 00:00:06,816 --> 00:00:10,459 Welcome back to the Heterogeneous Parallel Programming class. 4 00:00:10,459 --> 00:00:13,915 This is lecture 1.6, introduction to CUDA. 5 00:00:13,915 --> 00:00:18,871 We will be discussing the kernel-based SPMD parallel programming. 6 00:00:18,871 --> 00:00:21,511 The objective of this lecture is to help 7 00:00:21,511 --> 00:00:24,471 you to learn the basic concepts involved in a 8 00:00:24,471 --> 00:00:27,751 simple CUDA kernel function, how do we declare a 9 00:00:27,751 --> 00:00:30,871 kernel, how do we use built-in variables in a 10 00:00:30,871 --> 00:00:36,290 kernel, and also to review how we can use map thread index into data index. 11 00:00:38,730 --> 00:00:45,060 Here is a very simple example of how we can declare a CUDA kernel. 12 00:00:45,060 --> 00:00:47,322 And this is the vector addition kernel that 13 00:00:47,322 --> 00:00:50,790 we will eventually be launching from the host code. 14 00:00:50,790 --> 00:00:54,270 But there's several important parts in this kernel. 15 00:00:54,270 --> 00:01:00,230 The first one is the keyword. Every CUDA kernel function 16 00:01:00,230 --> 00:01:06,322 needs to be preceded by __global__ and 17 00:01:06,322 --> 00:01:12,409 this signals to the compiler that the function is a kernel function. 18 00:01:14,620 --> 00:01:18,650 The actual header of the function is exactly 19 00:01:18,650 --> 00:01:22,150 the same as any other traditional C function. 20 00:01:22,150 --> 00:01:25,312 So we have the name in this case is vecAddKernel, 21 00:01:25,312 --> 00:01:27,870 and it's just a name that we gave, we happen 22 00:01:27,870 --> 00:01:30,590 to give to the kernel, and it's followed by all 23 00:01:30,590 --> 00:01:34,460 the parameters that it it is expecting from the caller. 24 00:01:37,000 --> 00:01:42,320 Within the kernel we will be able to use several built-in variables. 25 00:01:42,320 --> 00:01:46,500 The variables that we shown in this in this the code 26 00:01:46,500 --> 00:01:51,290 example are threadIdx.x, blockDim.x and 27 00:01:51,290 --> 00:01:56,392 blockIdx.x. These are all built-in variables that are 28 00:01:56,392 --> 00:02:02,005 initialized by the hardware, so that every thread can access these variables. 29 00:02:02,005 --> 00:02:06,565 And depending on the thread index, and block index, every thread 30 00:02:06,565 --> 00:02:11,365 will have a different combination of the threadIdx and blockIdx values. 31 00:02:11,365 --> 00:02:14,821 So that's why we will be able to assign 32 00:02:14,821 --> 00:02:20,336 a different value to i for every for different threads. 33 00:02:20,336 --> 00:02:24,926 And we already have gone through this concept in a previous lecture. 34 00:02:24,926 --> 00:02:27,187 Finally, we can use 35 00:02:27,187 --> 00:02:34,130 i as a data index to access the the input data. 36 00:02:34,130 --> 00:02:38,230 So in this case, we'll be using i to access one 37 00:02:38,230 --> 00:02:41,989 A element, one B element, and to write into one C element. 38 00:02:43,540 --> 00:02:48,250 We're going to come back and discuss the if statement, why we need to test 39 00:02:48,250 --> 00:02:52,380 what, if i is less than n, n is the number of elements in 40 00:02:52,380 --> 00:02:54,250 the A, B, and C vectors. 41 00:02:57,920 --> 00:03:03,730 This slide shows the kernel launch code in the host code. 42 00:03:03,730 --> 00:03:07,670 In the previous lecture we talk about the allocation of 43 00:03:07,670 --> 00:03:11,900 memory, copying data to the device and then we say 44 00:03:11,900 --> 00:03:15,110 that we are going to to present the kernel launch 45 00:03:15,110 --> 00:03:19,190 part in this lecture and this is where it happens. 46 00:03:19,190 --> 00:03:23,496 So we're going to change the second part of the vecAdd 47 00:03:23,496 --> 00:03:27,850 function in the host code into a kernel launch. 48 00:03:27,850 --> 00:03:31,880 The kernel launch is very similar to any C function calls. 49 00:03:31,880 --> 00:03:34,900 So we still have the C function name, so this is the 50 00:03:34,900 --> 00:03:39,380 vecAddKernel, so this is just the name we assign to the kernel. 51 00:03:39,380 --> 00:03:44,700 But between the function name and the parameters that were 52 00:03:44,700 --> 00:03:49,410 given to the function we also need to supply a configuration 53 00:03:49,410 --> 00:03:52,900 parameter set to the, to the kernel. 54 00:03:52,900 --> 00:03:57,500 So we actually have two configuration parameters. 55 00:03:57,500 --> 00:04:01,550 The first one is the number of blocks in the grid 56 00:04:01,550 --> 00:04:05,438 and the second one is the number of threads in a block. 57 00:04:05,438 --> 00:04:09,320 So what we're saying in this particular statement is that 58 00:04:09,320 --> 00:04:14,470 we want to have 256 threads in each thread block. 59 00:04:14,470 --> 00:04:20,900 And we want to have n divided by 256.0, 60 00:04:20,900 --> 00:04:27,230 and with the ceiling function, number of blocks in the in, in the grid. 61 00:04:27,230 --> 00:04:33,090 So what's happening here is that whenever we want, we generate threads, 62 00:04:33,090 --> 00:04:37,540 we want to have enough threads to cover all the vector elements. 63 00:04:37,540 --> 00:04:39,600 So if we have 64 00:04:39,600 --> 00:04:44,030 256 vector elements, then it's a very simple case. 65 00:04:44,030 --> 00:04:48,420 Because it's 256 threads per block, and we only 66 00:04:48,420 --> 00:04:50,750 need to have one block to cover all the 256. 67 00:04:50,750 --> 00:04:57,500 So we will have 1 and 256 as our configuration parameter. 68 00:04:57,500 --> 00:05:03,730 So, if n is equal to 256, we can simply divide it by 256 and we'll get value 1. 69 00:05:03,730 --> 00:05:05,080 If n 70 00:05:05,080 --> 00:05:10,310 is 512, it's also simple, because 512 divided 71 00:05:10,310 --> 00:05:15,320 by 256 is 2. So, 2 thread blocks will be perfect for 72 00:05:15,320 --> 00:05:20,476 covering 512 elements. The case, that is much more 73 00:05:20,476 --> 00:05:25,840 complicated is when, is, when n is not a multiple of 256. 74 00:05:25,840 --> 00:05:31,249 Let's say if n is 1000. When we divide 1000 75 00:05:31,249 --> 00:05:38,203 by 256 we get 3 point some fraction. And this gives 76 00:05:38,203 --> 00:05:45,187 us a situation where we don't have a perfect multiple of 256. 77 00:05:45,187 --> 00:05:52,003 So in order to, to have enough threads to cover 1000 elements, we actually 78 00:05:52,003 --> 00:05:56,249 need to generate four thread blocks. And so that the 79 00:05:56,249 --> 00:06:03,165 1024 threads altogether will give sufficient coverage to 1000 elements. 80 00:06:03,165 --> 00:06:07,890 And, but then, we will end up with 24 threads that are extra. 81 00:06:07,890 --> 00:06:11,170 So in order to always generate enough threads, 82 00:06:11,170 --> 00:06:15,640 we will divide n by 256.0, and we generate 83 00:06:15,640 --> 00:06:18,561 a floating point number, which is 3 point 84 00:06:18,561 --> 00:06:21,620 a fraction, and then we would take the ceiling 85 00:06:21,620 --> 00:06:25,140 function to to be able to round 86 00:06:25,140 --> 00:06:29,400 up to 4. So this ceiling function ensures that we 87 00:06:29,400 --> 00:06:35,010 always generate the next bigger integer of blocks 88 00:06:35,010 --> 00:06:40,470 that will have enough threads to cover all the elements in this vector. 89 00:06:40,470 --> 00:06:46,680 The reason why we use 256.0 is because we need to make sure that we generate 90 00:06:46,680 --> 00:06:50,300 a floating point number, and that floating number will be rounded up. 91 00:06:50,300 --> 00:06:51,030 If we didn't 92 00:06:53,500 --> 00:06:58,800 use 256.0, but rather if we use 256, the division will be done as an integer 93 00:06:58,800 --> 00:07:01,770 division, so 1000 divided by 256 as an 94 00:07:01,770 --> 00:07:06,810 integer division, will actually generate a value 3. 95 00:07:06,810 --> 00:07:10,270 And then the ceiling function would not be able to take it up to 4. 96 00:07:10,270 --> 00:07:12,490 So this is one of the the ways to 97 00:07:12,490 --> 00:07:16,210 ensure that we can have, always have enough threads. 98 00:07:16,210 --> 00:07:18,310 Now going back to the previous slide. 99 00:07:20,540 --> 00:07:24,200 When we launch, let's say, the kernel to 100 00:07:24,200 --> 00:07:28,350 process 1000 elements, we will have 4 thread blocks. 101 00:07:28,350 --> 00:07:32,824 And we will have a, the last thread block will have 102 00:07:32,824 --> 00:07:38,060 threads that will cover up to a element that element 999, which is the 103 00:07:38,060 --> 00:07:44,720 1000th element because the element start from 0. 104 00:07:44,720 --> 00:07:45,950 However, we will 105 00:07:45,950 --> 00:07:52,740 have 24 threads, where the i value is going to be greater than or equal to 1000, 106 00:07:52,740 --> 00:07:58,190 and these threads really do not correspond to any real elements of A, B and C. 107 00:07:58,190 --> 00:08:01,850 So we don't want those elements to do anything and in fact 108 00:08:01,850 --> 00:08:06,560 if they try to access memory, they can actually corrupt the memory contents. 109 00:08:06,560 --> 00:08:11,090 So, what we want to do is, we want to be able to test whether the i 110 00:08:11,090 --> 00:08:13,930 value for every, for the thread, is 111 00:08:13,930 --> 00:08:18,180 actually less than or equal, less than 1000. 112 00:08:18,180 --> 00:08:21,770 Only threads that succeed in this test should 113 00:08:21,770 --> 00:08:24,820 do the vector, should do the element addition. 114 00:08:24,820 --> 00:08:27,240 So this is a reason why we need to have this check. 115 00:08:27,240 --> 00:08:30,190 And this is commonly referred to as boundary check. 116 00:08:30,190 --> 00:08:33,610 And it's actually testing whether a thread 117 00:08:33,610 --> 00:08:36,160 is supposed to be processing an element within 118 00:08:36,160 --> 00:08:38,990 the boundary or outside the boundary. 119 00:08:38,990 --> 00:08:42,650 So whenever the thread corresponds to an element that is outside the 120 00:08:42,650 --> 00:08:47,380 boundary, then the thread will not be doing anything in the computation. 121 00:08:50,350 --> 00:08:52,490 This slide shows two additional 122 00:08:54,510 --> 00:08:59,490 concepts for the kernel launch. The first one is, we 123 00:08:59,490 --> 00:09:05,420 actually can have two dimensional and three dimensional grids and blocks. 124 00:09:05,420 --> 00:09:08,778 So the particular method that we showed in the previous 125 00:09:08,778 --> 00:09:14,110 slide only works for one dimensional, launching one dimensional grids. 126 00:09:14,110 --> 00:09:18,910 So each of those parameters are actually a integer value. 127 00:09:19,970 --> 00:09:24,110 Here, we show that a, a general method for launching a 128 00:09:24,110 --> 00:09:29,380 grid of 2 dimension, 3 dimension, or in some cases, 1 dimension. 129 00:09:29,380 --> 00:09:33,840 So we, now we need to declare true dimensional variables. 130 00:09:33,840 --> 00:09:41,790 They're all typed dim3, and dim3 is available in the CUDA header file. 131 00:09:41,790 --> 00:09:46,080 And dim3 only means that a variable has three parts. 132 00:09:46,080 --> 00:09:51,520 Each part is a integer value. And we can initialize the variable with 133 00:09:51,520 --> 00:09:56,830 the syntax that we have here by having the variable name followed by parenthesis. 134 00:09:56,830 --> 00:10:01,740 So the first value will initialize the x value, the second value will 135 00:10:01,740 --> 00:10:07,170 initialize the y value, and the third value will initialize the z value. 136 00:10:07,170 --> 00:10:11,550 And in this case, because we're using a 1-dimensional grid, 137 00:10:11,550 --> 00:10:17,290 we only need to initialize the x-values of both the DimGrid and DimBlock. 138 00:10:17,290 --> 00:10:20,550 These variable names are of your choice. 139 00:10:20,550 --> 00:10:22,980 You can call them anything you want, as long as 140 00:10:22,980 --> 00:10:26,980 you use them in the right position of the kernel launch. 141 00:10:26,980 --> 00:10:32,330 So in this case, we call DimGrid the configuration that 142 00:10:32,330 --> 00:10:36,240 we want for the number of blocks in a grid. 143 00:10:36,240 --> 00:10:36,620 And then we 144 00:10:36,620 --> 00:10:42,510 have DimBlock as the number of threads in in the thread block. 145 00:10:42,510 --> 00:10:48,250 So we can use this as a more general way of launching kernels. 146 00:10:48,250 --> 00:10:51,280 The second one is, I'm actually giving you an alternative 147 00:10:51,280 --> 00:10:55,970 way to generate enough thread blocks to cover all the elements. 148 00:10:55,970 --> 00:11:01,750 Here you can use n minus one divided by 256, in this case 256 149 00:11:01,750 --> 00:11:06,060 can be a integer. So this will give us a integer 150 00:11:06,060 --> 00:11:10,730 value, and then you'll plus 1, and this will always give us 151 00:11:10,730 --> 00:11:15,840 the next bigger integer that will cover the, all the threads. 152 00:11:15,840 --> 00:11:21,280 So if we go back to the 1000 example, you know, 1000 153 00:11:21,280 --> 00:11:26,850 divided by 1 will be 999, subtracted by 1 will 154 00:11:26,850 --> 00:11:30,760 be 999, and if you divide it by 155 00:11:30,760 --> 00:11:34,850 256, you will get 3. And 3 plus 1 is 4. 156 00:11:34,850 --> 00:11:40,930 And the most important case is that if you have, let's say, 256, what happens 157 00:11:40,930 --> 00:11:46,870 is, 256 minus 1 is 255, so you will get a 0 from that division, but we add 158 00:11:46,870 --> 00:11:52,730 1. So everything from 0 to 256 159 00:11:52,730 --> 00:11:59,520 will give us exactly one block. In the according to this division. 160 00:11:59,520 --> 00:12:02,030 So if you work out all these possibilities, you'll 161 00:12:02,030 --> 00:12:04,870 realize that they all, this expression always gives you 162 00:12:04,870 --> 00:12:09,870 the exact number the correct number of thread blocks 163 00:12:09,870 --> 00:12:13,320 that will be big enough to cover all the elements. 164 00:12:16,850 --> 00:12:23,880 So now let's put everything together. The kernel execution works as follows. 165 00:12:23,880 --> 00:12:27,020 In the host code, you can, you have a 166 00:12:27,020 --> 00:12:30,180 host function, and then the host function can launch 167 00:12:30,180 --> 00:12:33,360 a kernel by calling the kernel function just like 168 00:12:33,360 --> 00:12:38,510 any other C function, but with additional configuration parameters. 169 00:12:38,510 --> 00:12:42,020 The configuration parameters need to be preceded by three 170 00:12:42,020 --> 00:12:45,230 less signs and followed by three greater signs. 171 00:12:45,230 --> 00:12:50,180 And so once we launched a kernel, we call the kernel function, then the 172 00:12:50,180 --> 00:12:55,870 hardware will generate a grid of threads to execute the kernel function. 173 00:12:55,870 --> 00:12:58,610 Every thread will be executing this kernel function. 174 00:12:59,980 --> 00:13:02,412 And the kernel function is declared in 175 00:13:02,412 --> 00:13:07,860 the CUDA program by preceding the kernel function 176 00:13:07,860 --> 00:13:13,458 decoration with a key word __global__. 177 00:13:13,458 --> 00:13:18,860 So, this allows the host code and the device code to connect, and so that 178 00:13:18,860 --> 00:13:24,880 whenever the host code launch this kernel, it will generate this grid of threads. 179 00:13:24,880 --> 00:13:28,190 And all the threads in the grid will 180 00:13:28,190 --> 00:13:33,615 have these built in variables, blockIdx, blockDim.x, and 181 00:13:33,615 --> 00:13:38,530 threadIdx.x. And these predefined, preinitialized 182 00:13:38,530 --> 00:13:44,850 variables will allow the threads to generate different data indices, 183 00:13:44,850 --> 00:13:50,490 so that each thread will be processing a different part of the code, of the data. 184 00:13:50,490 --> 00:13:54,190 So now, I like to, to just kind of summarize how 185 00:13:54,190 --> 00:13:58,180 you can treat all the function declarations in your CUDA program. 186 00:13:58,180 --> 00:13:58,790 We already talk 187 00:13:58,790 --> 00:14:02,870 about the kernel function and the host function. 188 00:14:02,870 --> 00:14:06,933 So there are really three types of functions in a CUDA program. 189 00:14:06,933 --> 00:14:11,774 The first type is what we have not talked about so far, which is device function. 190 00:14:11,774 --> 00:14:15,841 It's __device__ and this keyword tells the compiler 191 00:14:15,841 --> 00:14:18,082 that the function is going to be a 192 00:14:18,082 --> 00:14:23,975 device function that can only, that can be executed on a device, and it can be called 193 00:14:23,975 --> 00:14:27,846 by another function that is executing on the device. 194 00:14:27,846 --> 00:14:30,658 This kind of function is typically called by 195 00:14:30,658 --> 00:14:33,827 a kernel function or by other device functions. 196 00:14:33,827 --> 00:14:38,792 The second type is what we talk about the most so far, the kernel function. 197 00:14:38,792 --> 00:14:41,911 So the kernel function is a very special device function. 198 00:14:41,911 --> 00:14:46,136 It can be executed on the device, but it can only be called by the host. 199 00:14:46,136 --> 00:14:49,969 This is really the gateway of parallel execution. 200 00:14:49,969 --> 00:14:54,699 When the, when the host code wants to trigger any kind of parallel 201 00:14:54,699 --> 00:15:00,320 execution, it launches a kernel function. And the third type is the host function. 202 00:15:00,320 --> 00:15:05,536 It can be optionally preceded by __host__. 203 00:15:05,536 --> 00:15:10,000 And these kind of functions are just the traditional C functions that will 204 00:15:10,000 --> 00:15:13,848 be executed on the host, and can only be called from the host. 205 00:15:13,848 --> 00:15:15,040 So, a few 206 00:15:15,040 --> 00:15:20,222 important details. __global__ declares 207 00:15:20,222 --> 00:15:25,381 that kernel function. And make sure that you remember the, 208 00:15:25,381 --> 00:15:30,281 there are two underscores before and two underscores after. 209 00:15:30,281 --> 00:15:35,369 And the kernel function must return a void type, because these 210 00:15:35,369 --> 00:15:40,171 functions are special. These functions are really used to 211 00:15:40,171 --> 00:15:44,090 trigger parallel execution on the device. 212 00:15:44,090 --> 00:15:48,060 There's not an easy, simple way to to return a value. 213 00:15:48,060 --> 00:15:51,480 The question is, which thread would even give you that value back. 214 00:15:51,480 --> 00:15:56,710 So that's why to make it simple, the kernel functions must always return void. 215 00:15:56,710 --> 00:15:58,798 And the second one is, the device, 216 00:15:58,798 --> 00:16:05,540 __device__ and __host__ can be used together and 217 00:16:05,540 --> 00:16:11,260 this gives you a function that will be compiled in both ways. 218 00:16:11,260 --> 00:16:13,740 It will be compiled into a host function, 219 00:16:13,740 --> 00:16:16,990 it will be also compiled into a device function. 220 00:16:16,990 --> 00:16:22,090 This way the function can be appropriately used in both the host and device. 221 00:16:22,090 --> 00:16:26,030 And this is oftentimes used, you know used to generate useful 222 00:16:26,030 --> 00:16:30,280 utility functions that could be used on both device and host. 223 00:16:30,280 --> 00:16:34,350 Finally, if the host, if a function is truly only 224 00:16:34,350 --> 00:16:40,480 going to be a host function, then __host__ is really optional. 225 00:16:40,480 --> 00:16:43,680 And this allows you to port a C program 226 00:16:43,680 --> 00:16:47,750 into CUDA very quickly, because all the original programs 227 00:16:47,750 --> 00:16:51,450 can just remain as, with no keywords, and they 228 00:16:51,450 --> 00:16:53,740 will be assumed to run only on the host. 229 00:16:53,740 --> 00:16:56,000 And then you can gradually add keywords 230 00:16:56,000 --> 00:17:00,300 to the function declarations to make kernel functions and device functions 231 00:17:00,300 --> 00:17:03,580 or make some of the functions both host and device functions. 232 00:17:07,250 --> 00:17:12,700 Now that we know how to declare a kernel, how to launch a kernel and so on, I 233 00:17:12,700 --> 00:17:17,710 would like to talk, tell you a little bit more about how a CUDA program is compiled. 234 00:17:19,180 --> 00:17:21,570 In every file, we're going to have both host 235 00:17:21,570 --> 00:17:25,510 function and device function, so the NVCC, which is 236 00:17:25,510 --> 00:17:29,340 the Nvidia C compiler, will automatically identify all the 237 00:17:29,340 --> 00:17:32,870 host functions and kernel functions and separate them out 238 00:17:32,870 --> 00:17:35,420 into two compilation paths. 239 00:17:35,420 --> 00:17:37,130 The host functions will go through the host 240 00:17:37,130 --> 00:17:40,150 code path, where they will be conferred or compiled 241 00:17:40,150 --> 00:17:46,780 by a host C compiler and linker, such as GCC, such as Intel ICC, and so on. 242 00:17:46,780 --> 00:17:51,780 And the second one path is for the device function or the kernel functions. 243 00:17:51,780 --> 00:17:56,250 And these functions are compiled into a binary file format called PTX. 244 00:17:56,250 --> 00:17:57,920 Now, this is very similar 245 00:17:57,920 --> 00:17:59,600 to Java bytecode. 246 00:17:59,600 --> 00:18:05,220 At run time, there is a just-in-time compiler that will take the PTX binary and 247 00:18:05,220 --> 00:18:09,400 generate the real ISA, instruction set architecture, binary 248 00:18:09,400 --> 00:18:12,190 for that particular device that you are using. 249 00:18:12,190 --> 00:18:17,730 So then, the, both types of executables will be used 250 00:18:17,730 --> 00:18:22,960 for this Heterogeneous Computing Platform, and the host path will be 251 00:18:22,960 --> 00:18:27,925 used on the host device which is the CPU, and then the device 252 00:18:27,925 --> 00:18:33,320 binary will be run on the device you know, hardware, such as a GPU. 253 00:18:33,320 --> 00:18:37,640 And this allows you to take a single piece of program with 254 00:18:37,640 --> 00:18:39,440 host and device functions and run 255 00:18:39,440 --> 00:18:42,740 them correctly in a heterogeneous computing platform. 256 00:18:45,080 --> 00:18:49,390 So at this point, we have finished the basic concepts 257 00:18:50,480 --> 00:18:55,430 of, involved in writing a simple, basic CUDA program. 258 00:18:55,430 --> 00:19:00,620 And so I would like to encourage you to begin to take the lab tour, 259 00:19:00,620 --> 00:19:05,660 and the lab tour will show you the the environment 260 00:19:05,660 --> 00:19:10,590 for you to write the CUDA kernels and so on 261 00:19:10,590 --> 00:19:12,900 for the for all the assignments. 262 00:19:12,900 --> 00:19:14,800 And then we would also show you how to get 263 00:19:14,800 --> 00:19:19,400 the instructions and how to submit the your, your assignment. 264 00:19:20,650 --> 00:19:25,700 Once you finish that tour, I like you to start encourage you to start, the 265 00:19:25,700 --> 00:19:30,690 first lab assignment which is the vector addition CUDA code. 266 00:19:30,690 --> 00:19:36,490 So this will help you to solidify your understanding of all the concepts 267 00:19:36,490 --> 00:19:38,200 that we discussed to date. 268 00:19:38,200 --> 00:19:44,480 For those of you who would like to learn more about the CUDA language at the 269 00:19:44,480 --> 00:19:48,580 introductory level, I would like you to, recommend 270 00:19:48,580 --> 00:19:51,760 that you read chapter 3 of the textbook. 271 00:19:51,760 --> 00:19:52,120 Thank you.